ML2 Final¶

Datasource¶

Source: Kaggle - Unsupervised Learning on Country Data

Clustering the Countries by using Unsupervised Learning for HELP International Objective: To categorise the countries using socio-economic and health factors that determine the overall development of the country.

About organization: HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

Problem Statement:¶

HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most. World by Income and Region - The World Bank We will try to categorize countries by income group to determine the countries that need more helps. The end goal will be compared to data on The World Bank. We will aim to divide countries into 4 groups:

  • High Income
  • Middle Upper Income
  • Middle Lower Income
  • Low Income

Step 1: Importing libraries and loading data¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px

from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from kneed import KneeLocator
warnings.filterwarnings('ignore') # Ignore warnings
In [2]:
df = pd.read_csv('Country-data.csv')
display(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     167 non-null    object 
 1   child_mort  167 non-null    float64
 2   exports     167 non-null    float64
 3   health      167 non-null    float64
 4   imports     167 non-null    float64
 5   income      167 non-null    int64  
 6   inflation   167 non-null    float64
 7   life_expec  167 non-null    float64
 8   total_fer   167 non-null    float64
 9   gdpp        167 non-null    int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 13.2+ KB
None
In [3]:
df.describe(include='all')
Out[3]:
country child_mort exports health imports income inflation life_expec total_fer gdpp
count 167 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000
unique 167 NaN NaN NaN NaN NaN NaN NaN NaN NaN
top Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN 38.270060 41.108976 6.815689 46.890215 17144.688623 7.781832 70.555689 2.947964 12964.155689
std NaN 40.328931 27.412010 2.746837 24.209589 19278.067698 10.570704 8.893172 1.513848 18328.704809
min NaN 2.600000 0.109000 1.810000 0.065900 609.000000 -4.210000 32.100000 1.150000 231.000000
25% NaN 8.250000 23.800000 4.920000 30.200000 3355.000000 1.810000 65.300000 1.795000 1330.000000
50% NaN 19.300000 35.000000 6.320000 43.300000 9960.000000 5.390000 73.100000 2.410000 4660.000000
75% NaN 62.100000 51.350000 8.600000 58.750000 22800.000000 10.750000 76.800000 3.880000 14050.000000
max NaN 208.000000 200.000000 17.900000 174.000000 125000.000000 104.000000 82.800000 7.490000 105000.000000
In [4]:
df.head()
Out[4]:
country child_mort exports health imports income inflation life_expec total_fer gdpp
0 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
1 Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
2 Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
3 Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
4 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200
In [5]:
df.isna().sum()
Out[5]:
country       0
child_mort    0
exports       0
health        0
imports       0
income        0
inflation     0
life_expec    0
total_fer     0
gdpp          0
dtype: int64

EDA¶

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt

numeric_df = df.drop('country', axis=1)

for col in numeric_df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 4))
    plt.suptitle(f'{col.upper()} Distribution & Outliers', fontsize=16, fontweight='bold')

    # Boxplot
    sns.boxplot(x=col, data=numeric_df, ax=axes[0], color='lightblue')
    axes[0].set_title('Boxplot', fontsize=12)
    
    # Distribution Plot
    sns.histplot(numeric_df[col], kde=True, ax=axes[1], color='salmon')
    axes[1].set_title('Distribution', fontsize=12)
    
    # Add skewness info
    skewness = numeric_df[col].skew()
    axes[1].annotate(f'Skewness: {skewness:.2f}', xy=(0.7, 0.9), xycoords='axes fraction', fontsize=10,
                     bbox=dict(facecolor='white', alpha=0.7))
    
    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [7]:
# Calculate the correlation matrix
corr = numeric_df.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create the heatmap
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True)

# Add title
plt.title('Correlation Heatmap of Country Statistics', fontsize=16)

# Show plot
plt.show()
No description has been provided for this image

EDA Take Aways¶

  • Data is pretty cleaned
  • No outlier
  • Notable positive correlation:
    • Income - GDP: 0.90 => Higher GDP often goes with higher income
    • Total Fertility Rate - Child Mortality: 0.85 => More babies often goes with babies are more likely to die
    • Imports - Exports: 0.85 => More imports often goes with more exports
  • Notable negative correlation:
    • Life Expectancy - Child Mortality: -0.89 => Better life expectancy associated with less babies dying
    • Total Fertility Rate - Life Expectancy: -0.76 => Better life expectancy associated with less babies being borned
In [8]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_df = pd.DataFrame(scaler.fit_transform(numeric_df), columns=numeric_df.columns)
normalized_df.head()
Out[8]:
child_mort exports health imports income inflation life_expec total_fer gdpp
0 0.426485 0.049482 0.358608 0.257765 0.008047 0.126144 0.475345 0.736593 0.003073
1 0.068160 0.139531 0.294593 0.279037 0.074933 0.080399 0.871795 0.078864 0.036833
2 0.120253 0.191559 0.146675 0.180149 0.098809 0.187691 0.875740 0.274448 0.040365
3 0.566699 0.311125 0.064636 0.246266 0.042535 0.245911 0.552268 0.790221 0.031488
4 0.037488 0.227079 0.262275 0.338255 0.148652 0.052213 0.881657 0.154574 0.114242

PCA¶

  • We performs PCA with 0 components
  • We can describe upto 97% of the result with 6 PC
  • All 6 PC have 0 correlations, as we wanted
In [9]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA with 9 components
pca = PCA(n_components=9)
pca_components = pca.fit_transform(normalized_df)

# Create a DataFrame for the PCA results
pca_df_9 = pd.DataFrame(data=pca_components, columns=[f'PC{i+1}' for i in range(9)])
pca_df_9['country'] = df['country']  # Optional: add country column back
In [10]:
# Explained variance ratio for each component
explained_variance = pca.explained_variance_ratio_

plt.figure(figsize=(10, 6))
plt.plot(range(1, 10), explained_variance, marker='o', linestyle='--', color='b')
plt.title('Explained Variance by Each Principal Component', fontsize=16)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.xticks(range(1, 10))
plt.grid(True)
plt.show()
No description has been provided for this image
In [11]:
# Cumulative sum of explained variance
cumulative_variance = explained_variance.cumsum()
print(cumulative_variance)
plt.figure(figsize=(10, 6))
plt.plot(range(1, 10), cumulative_variance, marker='o', linestyle='--', color='green')
plt.title('Cumulative Explained Variance by Principal Components', fontsize=16)
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.xticks(range(1, 10))
plt.grid(True)
plt.show()
[0.55001227 0.6838601  0.80687063 0.9043611  0.94214073 0.97227732
 0.98418166 0.99305958 1.        ]
No description has been provided for this image
In [12]:
from sklearn.decomposition import IncrementalPCA

# Initialize Incremental PCA
ipca = IncrementalPCA(n_components=6)

# Fit and transform
ipca_components = ipca.fit_transform(normalized_df)

# Create a DataFrame for IPCA results
ipca_df = pd.DataFrame(data=ipca_components, columns=[f'PC{i+1}' for i in range(6)])
ipca_df['country'] = df['country']  # Optional: Add back 'country'

ipca_df.head()
Out[12]:
PC1 PC2 PC3 PC4 PC5 PC6 country
0 0.599055 0.095593 0.157410 0.024592 0.044905 -0.044183 Afghanistan
1 -0.158467 -0.212146 -0.064079 0.061208 -0.014380 -0.013885 Albania
2 -0.003676 -0.135878 -0.134090 -0.133633 0.091475 0.024286 Algeria
3 0.650256 0.275948 -0.142585 -0.156112 0.082833 0.030830 Angola
4 -0.200710 -0.064655 -0.100702 0.037951 0.035631 -0.057256 Antigua and Barbuda
In [13]:
explained_variance_ipca = ipca.explained_variance_ratio_

# Plot explained variance
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(range(1, 7), explained_variance_ipca.cumsum(), marker='o', linestyle='--', color='purple')
plt.title('Cumulative Explained Variance by Incremental PCA Components', fontsize=16)
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.xticks(range(1, 7))
plt.grid(True)
plt.show()
No description has been provided for this image
In [14]:
import numpy as np
import seaborn as sns

final_pca = ipca.fit_transform(normalized_df)

# Calculate the correlation matrix of the PCA components
pc = np.transpose(final_pca)  # Transpose to get components in rows
corrmat = np.corrcoef(pc)

# Plot heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corrmat, annot=True, fmt=".2f", linewidth=0.75, cmap="Blues")
plt.title('Correlation Matrix of PCA Components')
plt.show()
No description has been provided for this image

K Means Clustering¶

  • We perform clustering with many n_clusters and many pca components counts
  • Best n clusters are 3-4

Dividing into real groups¶

  • We will go with 4 clusters and dividing them into 4 groups of income and compare it to data from The world Bank.
  • The result seems to be mostly similar to TWB data
  • We still missed some data on countries like Mexico due to data missing countries.
In [15]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

cluster_range = range(1, 9)  # Clusters from 1 to 8
pca_components_range = range(2, 7)  # PCA components from 2 to 6

plt.figure(figsize=(18, 18))

# Loop through different numbers of PCA components
for i, n_components in enumerate(pca_components_range, 1):
    # Perform PCA with 'n_components' components
    pca = PCA(n_components=n_components)
    pca_data = pca.fit_transform(final_pca)  # Apply PCA to the data

    # Loop through different cluster sizes
    for j, n_clusters in enumerate(cluster_range, 1):
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        
        # Fit K-means on the PCA data (using the selected number of components)
        kmeans.fit(pca_data)
        
        # Get the labels (cluster assignments) for each data point
        labels = kmeans.labels_
        
        # Create a DataFrame to visualize the clusters with PCA components
        df_clusters = pd.DataFrame(pca_data, columns=[f'PC{i+1}' for i in range(pca_data.shape[1])])
        df_clusters['Cluster'] = labels

# Initialize list to store inertia values
inertia_values = []

# Loop through PCA components and clusters
for n_components in pca_components_range:
    pca = PCA(n_components=n_components)
    pca_data = pca.fit_transform(final_pca)  # Apply PCA to the data
    
    # For each cluster count
    for n_clusters in cluster_range:
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        kmeans.fit(pca_data)
        inertia_values.append((n_components, n_clusters, kmeans.inertia_))

# Create a DataFrame to visualize inertia for each combination of PCA components and clusters
inertia_df = pd.DataFrame(inertia_values, columns=['PCA Components', 'Clusters', 'Inertia'])

# Plot the inertia values
plt.figure(figsize=(10, 6))
sns.lineplot(x='Clusters', y='Inertia', hue='PCA Components', data=inertia_df, marker='o')
plt.title('Inertia vs Number of Clusters and PCA Components')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()
<Figure size 1800x1800 with 0 Axes>
No description has been provided for this image
In [28]:
from sklearn.cluster import KMeans
import pandas as pd

# Perform K-means clustering with 3 clusters (since you've identified it as optimal)
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(final_pca)

# Get the cluster labels (the tier for each country)
country_clusters = kmeans.labels_

# Assuming the 'df' contains a 'country' column (which holds the names of the countries)
# Add the cluster labels (tiers) as a new column in the dataframe
df['Cluster_Tier'] = country_clusters

# Map the cluster labels to descriptive names (optional)
tier_mapping = {0: 'Tier 1', 1: 'Tier 2', 2: 'Tier 3', 3: 'Tier 4'}
df['Cluster_Tier_Descriptive'] = df['Cluster_Tier'].map(tier_mapping)

# Calculate the average GDP per capita for each tier (cluster)
avg_gdp_per_tier = df.groupby('Cluster_Tier_Descriptive')['gdpp'].mean().reset_index()

# Sort the tiers based on average GDP, highest to lowest
avg_gdp_per_tier_sorted = avg_gdp_per_tier.sort_values(by='gdpp', ascending=False)

# Assign development status based on sorted GDP values
# The highest GDP will be assigned 'Developed', the middle 'Developing', and the lowest 'Underdeveloped'
Income_Status_mapping = {
    avg_gdp_per_tier_sorted.iloc[0]['Cluster_Tier_Descriptive']: 'High Income',
    avg_gdp_per_tier_sorted.iloc[1]['Cluster_Tier_Descriptive']: 'Upper Middle Income',
    avg_gdp_per_tier_sorted.iloc[2]['Cluster_Tier_Descriptive']: 'Lower Middle Income',
    avg_gdp_per_tier_sorted.iloc[3]['Cluster_Tier_Descriptive']: 'Low Income'
}

# Map the 'Income_Status' column based on the sorted tiers
df['Income_Status'] = df['Cluster_Tier_Descriptive'].map(Income_Status_mapping)

# Display the countries grouped by their development status
tier_groups = df.groupby('Income_Status')['country'].apply(list).reset_index()

print(tier_groups)

print(df[['country', 'Cluster_Tier_Descriptive', 'Income_Status']].head(20))
         Income_Status                                            country
0          High Income  [Australia, Austria, Belgium, Brunei, Canada, ...
1           Low Income  [Afghanistan, Angola, Benin, Burkina Faso, Bur...
2  Lower Middle Income  [Algeria, Bangladesh, Bhutan, Bolivia, Botswan...
3  Upper Middle Income  [Albania, Antigua and Barbuda, Argentina, Arme...
                country Cluster_Tier_Descriptive        Income_Status
0           Afghanistan                   Tier 4           Low Income
1               Albania                   Tier 3  Upper Middle Income
2               Algeria                   Tier 1  Lower Middle Income
3                Angola                   Tier 4           Low Income
4   Antigua and Barbuda                   Tier 3  Upper Middle Income
5             Argentina                   Tier 3  Upper Middle Income
6               Armenia                   Tier 3  Upper Middle Income
7             Australia                   Tier 2          High Income
8               Austria                   Tier 2          High Income
9            Azerbaijan                   Tier 3  Upper Middle Income
10              Bahamas                   Tier 3  Upper Middle Income
11              Bahrain                   Tier 3  Upper Middle Income
12           Bangladesh                   Tier 1  Lower Middle Income
13             Barbados                   Tier 3  Upper Middle Income
14              Belarus                   Tier 3  Upper Middle Income
15              Belgium                   Tier 2          High Income
16               Belize                   Tier 3  Upper Middle Income
17                Benin                   Tier 4           Low Income
18               Bhutan                   Tier 1  Lower Middle Income
19              Bolivia                   Tier 1  Lower Middle Income
In [33]:
fig = px.choropleth(df[['country','Cluster_Tier']],
                    locationmode = 'country names',
                    locations = 'country',
                    color = df['Income_Status'],  
                    color_discrete_map = {'High Income': 'Green',
                                          'Upper Middle Income':'LightGreen',
                                          'Lower Middle Income':'Orange', 'Low Income': 'Red'}
                   )

fig.update_layout(
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=2,
            ),
    )
fig.show()

World by Income and Region - The World Bank Data

In [ ]: